us 20020142277A1 

(19) United States 

(12) Patent Application Publication (lo) Pub. No.: US 2002/0142277 Al 

Burstein et al. (43) Pub. Date: Oct 3, 2002 



(54) METHODS FOR AUTOMATED ESSAY 
ANALYSIS 



(76) Inventors: Jifl Burstein, Princeton, NJ (US); 

Daniel Marcu, Manna del Rey, CA 
(US); VyacheslaT Andreyev, Ewing, NJ 
(US); Martin Sanford Chodorow, 
New York, NY (US); Claudia 
Leacock', New York, NY (US) 



Correspondence Address: 
WUmo; Culler & Pidcering 
2445 M Street, NW 
Washington, DC 20037 (US) 



(21) Appl. No.: 

(22) Filed: 



10/052,380 
Jan. 23, 2002 



Related U.S. Application Data 

(60) Provisional application No, 60/263,223, filed on Jan. 
23, 2001. 

Publication Classification 

(51) Int. Cl7 G09B 7/00 

(52) U.S.a 434/335 



(57) 



ABSTRACT 



An essay is analyzed automatically by accepting the essay 
and determining whether each of a predetermined set of 
features is present or absent in each sentence of the essay. 
For each sentence in the essay a probability that the sentence 
is a member of a certain discourse element category is 
calculated. The probability is based on the determinations of 
whether each feature in the set of features is present or 
absent. Furthermore, based on the calculated probabilities, a 
sentence is chosen as the choice for the discouirse element 
category. 



Jin 

1 



ACCEPT 
ESSAY 



GET NEXT 
SEMTeiCE 

12D| 



DETBtBIINE PRESENCE 
OR ABSBICE OF EACH 
FEATURE Ai^Jbi 



COUPUTE 



EXPRESSION 



140 




NO 



LOOP 
US 



CHOOSE THE 
SBfTENCEWrm 

HAX&aiss PTOBABam 



mi 



DONEJBi ^ 



Patent AppUcation PubUcation Oct 3, 2002 Sheet 1 of 2 US 2002/0142277 Al 



ISSt 



ACCEPT 
ESSAY 

110] 



GET NEXT 
SENTENCE 

J2QJ 



LOOP 
11£ 



DETERMINE PRESENCE 
OR ABSENCE OF EACH 
FEATURE Ai...An 

130 



COMPUTE 
PROBABIUTY 
EXPRESSION 

140 




YES 



CHOOSE THE 
SENTENCE WITH 
MAXIMUM PROBABILITY 
EXPRESSION 

IfiQ 



Q DONE m ) 

FIG. 1 



Patent AppUcati n PubUcation Oct 3, 2002 Sheet 2 of 2 US 2002/0142277 Al 



2QQ 



ACCEPT 
ESSAYS 



I 



ACCEPT 
MANUAL 
ANNOTATION 



22Q 



i=EATURE 
DETERRdlNATION 
22& 



DETERMINE UNIVERSE 
OPPOSITIONAL 
FEATURES Ai...Ak 

230 



DETERMINE UNIVERSE 
OF WORD CHOICE 
FEATURES Aiui..J\ii 
240 



RUN RST PARSER TO 
DETERMINE UNIVERSE 
OF RST FEATURES 
Ah»i...An 

2SQ 



COMPUTE 
EMPIRICAL 
PROBABILmES 
2SQ 



FIG, 2 



us 2002/0142277 Al 



1 



Oct. 3,2002 



METHODS FOR AUTOMATED ESSAY ANALYSIS 

[0001] This application claims priority to U.S. Provisional 
Patent Application No. 60/263^, filed Jan. 23, 2001, 
which is incorporated herein by reference. 

HELD OF THE INVENTION 

[0002] This invention relates generally to document pro- 
cessing and automated identification of discourse elements, 
such as a thesis statements, in an essay. 

BACKGROUND OF THE INVENTION 

[0003] Given the success of automated essay scoring 
technology, such application have been integrated into cur- 
rent standardized writing assessments. The writing conunu- 
nity has expressed an interest in the development of an essay 
evaluation systems that include feedbadc about essay char- 
acteristics to facilitate the essay revision process. 

[0004] There are many factors that contribute to overall 
improvement of developing writers. These factors include, 
for example, refined sentence structure, variety of appropri- 
ate word usage, and organizational structure. The improve- 
ment of organizational structure is believed to be critical in 
the essay revision process toward overall essay quality. 
Therefore, it would be desirable to have a system that could 
indicate as feedback to students, the discourse elements in 
their essays. 

SUMMARY OF THE INVENTION 

[0005] The invention facilitates the automatic analysis, 
identification and classification of discourse elements in a 
sample of text. 

[000(r| In one respect, the inventK>D is a method for 
automated analysis of an essay. The noetbod comprises the 
steps of accepting an essay; determinir^ whether each of a 
predetermined set of features is present or absent in each 
sentence of the essay; for each sentence in the essay, 
calculating a probability that the sentence is a member of a 
certain discourse element category, \^erein the probability 
is based on the determinations of whether each feature in the 
set of features is present or absent; and choosing a sentence 
as the choice for the discourse element category, based on 
the calculated probabDities. The discourse element category 
of preference is the thesis statement. The essay is preferably 
in Uie form of an electronic document, such as an ASCII file. 
The predetermined set of features preferably comprises the 
following: a feature based on the position within the essay; 
a feature based on the presence or absence of certain words 
wherein the certain words comprise word^ of belief that are 
empirically associated with thesis statements; and a feature 
based on the presence or absence of certain words wherein 
the certain words comprise words that have been determined 
to have a rhetorical relaticHi based on the output of a 
rhetorical structure parser. The calculation of the probabili- 
ties is preferably done in the form of a imiltivariate Bernoulli 
model. 

[0007] In another respect, the invention is a process of 
training an automated essay analyzer. The training process 
accepts a plurality of essays and manual aimotations 
demarking discourse elements in the plurality of essays. The 
training process accepts a set of features that purportedly 
correlate with whether a sentence in an essay is a particular 



type of discourse element. The training process calculates 
empirical probabilities relating to the frequency of the 
features and relating featiu^es in the set of features to 
discourse elements. 

[0008] In yet other respects, the invention is computer 
readable media on which are embedded computer programs 
that perform the above method and process. 

[0009] In comparison to known prior art, certain embodi- 
ments of the invention are capable of achieving certain 
advantages, including some or all of the following: (1) 
eliminating the need for human involvement in providing 
feedback about an essay; (2) improving the timeliness of 
feedback to a writer of an essay; and (3) cross utilization of 
essay automatic essay analysis parameters determined from 
essays on a given topic to essays on different topics or 
responding to different questions. Those skilled in the art 
will appreciate these and other advantages and benefits of 
various embodiments of the invention upon reading the 
following detailed description of a preferred embodiment 
with reference to the below-listed drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0010] FIG. 1 is a flowchart of a method for providing 
automated essay feedback, according to an embodiment of 
the invention; and 

[0011] FIG. 2 is a flowchart of a process for training the 
automated essay feedback method of FIG. 1, according to 
an embodiment of the invention. 

DETMLED DESCRIPTION OF A PREFERRED 
EMBODIMENT 

[0012] I. Overview 

[0013] Using a small corpus of essay data where thesis 
statements have been manually annotated, a Bayesian clas- 
sifier can be built using the following features: a) sentence 
position, b) words commonly used in thesis statements, and 
c) discourse features, based on rhetorical structure theory 
(RST) parses. Experimental results indicate that this classi- 
fication technique may be used toward the automatic iden- 
tification of thesis statements in essays. Furthermore, the 
method generalizes across essay topics. 

[0014] A thesis statement is generaUy defined as the 
sentence that expficitly identifies the purpose of the paper or 
previews its main ideas. Although th^ definition seems 
straightforward enough, and would lead one to beUeve that 
even for people to identify the thesis statement in an essay 
would be clear-cut. However, this is not always the case. In 
essays written by developing writers, thesis statements are 
not so clearly and ideas are repeated. As a result, human 
readers sometimes independendy choose different thesis 
statements from the same student essay. 

[0015] The vahie of this system is that it can be used to 
indicate as feedback to students, the discourse elements in 
their essays is advantageous. Such a system could present to 
students a guided list of questions to consider about the 
quality of the discourse. For instance, it has been suggested 
by writing e^rts that if the thesis statement of a student's 
essay could be automatically provided, the student could 
then use this information to reflect on the thesis statement 
and its quahty. In addition, such an instructional application 



us 2002/0142277 Al 



2 



Oct. 3, 2002 



could utilize the thesis statement to discuss other types of 
discourse elements in the essay, such as the relationship 
between the thesis statement and the conclusion, and the 
connection between the thesis statement and the main points 
in the essay. In the teaching of writing, students are often 
presented with a "Revision CheddisL" The "Revision 
Checklist" is intended to facilitate the revision process. This 
is a list of questions posed to the student that help the student 
reflect on the quality of their writing. So, for instance, such 
a list might pose questions as in the following, (a) Is the 
intention of my thesis statement clears, (b) Does my thesis 
statement re^nd directly to the essay question?, (c) Are the 
main points in my essay clearly stated?, and (d) Do the main 
points in my essay relate to my original thesis statement? 

[0016] The ability to automatically identify, and present to 
students the discourse elements in their essays can help them 
to focus and reflect on the critical discourse structure of the 
essay. In addition, the ability for the application to indicate 
to the student that a discourse element could not be located, 
perhaps due to the 'lack of clarity' of this element could also 
be helpful. Assuming that such a capability were reliable, 
this would force the writer to think about the clarity of a 
given discourse element, such as a thesis statement. 

[0017] II. Providing Automated Essay Analysis 

[0018] FIG. 1 is a flowchart of a method 100 for providing 
automated essay analysis, according to an embodiment of 
the invention. The method 100 estimates \i^^cb sentence in 
an essay is most likely to belong to a certain discourse 
category, such as thesis statement, oondu^n, etc. The 
method 100 begins by accepting (HO) an essay. The essay 
is preferably in electronic form at this step. The method 100 
next performs a loop 115. The method 100 makes one pass 
through the loop 115 for each sentence in the essay. Each 
pass of the loop 115 gets (120) the next sentence and 
determines (130) the presence or absence of each feature A, 
. . . A^ (the feature Aj . . . A^ having been predetermiised to 
be relevant to the particular discourse category). If more 
than one discourse categories is evaluated, a dijSerent set of 
features A^ . . . A^ may be predetermined for each discourse 
category. The loop 115 next computes (140) a probability 
expression for each sentence (S) for the discourse category 
(T) using the formula below. 

Elo^PiAi I T)/PiAi)] if Ai present 
_ 
lo^fXAi I T)/P{Ai)] if Ai not present 



[0019] where is the prior probability that a sentence 
is in discourse category T; P(AJT) is the conditional prob- 
ability of a sentence having feature A;, given that the 
sentence is in T; P(Aj) is the prior probability that a sentence 
contains feature A^; P(AjT) is the conditional probability 
that a sentence does not have feature A^, given that it is in 
T; and P(Aj) is the prior probability that a sentence does not 
contain feature A^. Performance can be improved by using a 
LaPlace estimator to deal with cases when the probability 
estimates are zero. 

[0020] The method 100 next tests (150) whether the 
current resource is the last and bops back to the getting next 
sentence step 120 if not. After a probability expression has 



been evaluated for every sentence, the method 100 chooses 
(160) the sentence with the maximum probability expression 
for the particular discourse category. The method 100 can be 
repeated for each different discourse category. 

[0021] Preferably, the accepting step 110 directly accepts 
the document in an electronic form, such as an ASCII file. 
In another embodiment, the accepting step 110 comprises 
the steps of scaiming a paper form of the essay and per- 
forming optical character recognition on the scanned paper 
essay. 

[0022] In one embodiment, the determining step 130 and 
computing step 140 repeat through the indexed list of 
features Aj . . . Af^ and updates the value of the probability 
expression based on the presence or absence of each feature 
Aj . . . Af^. Another embodiment of the determining step 130 
and computing step 140 is that the presence or absence of all 
features A^ . . . ^ could be determined (130) and then the 
probability expression could be computed (140) for that 
sentence. Those skilled in the art can appreciate that the 
steps of the method 100 can be performed in an order 
different firom that illustrated, or simultaneously, in alterna- 
tive embodiments. 

[0023] in. Example of Use 

[0024] As an example of the method 100, consider the 
case y/hcn the discourse category is a thesis statement, so 
that the method 100 estimates which sentence in an essay is 
most likely to be the thesis statement. Assume that the 
method 100 utilizes only positional and word occurrence 
features to identify the thesis statement, as follows: 

[0025] Ai-W_FEEL-Occurrence of the word "feel " 

[0026] A2«SP_l=Being the first sentence in an 
essay. 

[0027] A3«SP_^Being the second sentence in an 
essay. 

[0028] A4=SP__3'»Being the third sentence in an 
essay. 

[0029] A5^P_4«Being the fourth sentence in an 
essay. 

[0030] Etc. 

[0031] Assume further that the prior and conditional prob- 
abilities for these features have been predetermined or 
otherwise supplied. Typically, these probabilities are deter- 
mined by a training process (as described in detail below 
with reference to FIG. 2). For this example, assume that the 
above features were determined empirically by examining 
93 essays containing a grand total of 2391 sentences, of 
which 111 were denoted by a human annotator as being 
the»s statements. From this data set, the following prior 
probabilities were determined by counting frequencies of 
feature occunence out of the total nimiber of sentences 
(where the preceding slash denotes the **nor or comple- 
ment operator): 

[0032] P(THESIS>=111/2391=0.0464 

[0033] P(W_FEEL)=188/2391=0.0786 

[0034] P(AV_FEEL)=1-0.0786=0.9213 

[0035] P(SP_1)=93/2391=0.0388 



us 2002/0142277 Al 



3 



Oct. 3, 2002 



[0036] P(/SP_l)-l-0.0388-0.9611 

[0037] P(SP_J2>.93/2391«0.0388 

[0038] P(/SP_2)-1-0.0388=0.9611 

[0039] P(SP„3>-93/2391-0.0388 

[0040] P(/SP_3)-1-0.0388=0.9611 

[0041] P(SP_4)=93/2391«0.0388 

[0042] P(/SP_4)=1-0.Q388=0.9611 

[0043] It can be seen from these numbers, that every essay 
in the training set contained at least four sentences. One 
skilled in the art could continue with additional sentence 
position feature probabilities, but only four are needed in the 
example that follows. 

[0044] From the same data set, the following conditional 
probabilities were determined by counting frequencies of 
feature occurrence out of the thesis sentences only: 

[0045] P(W_FEEL|THESIS)=35/111=03153 

[0046] P(AV_FEELtTHESIS)=l-0.1861«0.6847 

[0047] P(SP_l|THESIS)-24/lll-0:2162 

[0048] P(/SP_2|THESIS)=1-0.2162=0.7838 

[0049] P(SP_2|THESIS)»15/111=0.1612 

[0050] P(/SP„2|THESIS)=1-0.1612=0,8388 

[0051] P(SP_3frHESIS)=13/lll=0.1171 

[0052] P(/SP_3|THESIS)-1-0.1171=0.8829 

[0053] P(SP_4frHESIS)-14/lll=0.1262 

[0054] P(SP„4|THESIS)=1-0.1262^.8739 

[0055] With this preliminary data set, the method 100 
begins by reading (110) the following brief essay: 

[0056] Most of the time we as people experience a lot 
of conflicts in hfe. We put are setfs in conflict every 
day by choosing between something that we want to 
do and something that we feel we should do. For 
example, I new friends and family that they wanted 
to go to the army. But they new that if they went to 
college they were going to get a better education. 
And now my friends that went to the army tell me 
that if they had that chance to go back and make that 
choice again, they will gp witii the feeling that will 
make a better choice. 

[0057] The method 100 loops through each sentence of the 
above essay, sentence by sentence. The first sentence, 
denoted SI, is "Most of the time . . . Kfe." The observed 
features of SI are AV_FEEL, SP_1, /SP_2, /SP_3 
and /SP_4, as this sentence is the first sentence of the essay 
and does not contain the word "feel.'' The payability 
e;q)ression for this sentence is computed (140) as follows: 

iog[PCiisi)Hog[PCr)] 

+log [P(/W_FEELfiyP(/W_FEEL)] 

+iog[P(SP_iro/p(a»_j)] 

+log [P(^P_2fI)/P(/SP_2)] 
+log [PaSP_3riVP(/SP_3)] 
+log [P(/SF_4fiyP(/SP_4)l 
=lo8 10.0464] 



+Iog [0.6847/0.9213] 
•Mog [0.21 62/0.0388] 
-Mog [0.S388/D.9611] 
+4og [0.8829/0.9611] 
+log [0.8739/0.9611] 
—0.8537 

[0058] The second "senlenoe," denoted S2, is actually two 
sentence, but the method can treat a group of sentences as 
single sentence, when, for example, the sentences are related 
in a certain manner, such as in this case where the second 
sentence begins with the phrase "For example ..." Thus, S2 
in this example is "We put . . . army.** It's features are ySP__l, 
SP_2, /SP_3, /SP_4 and W_FEEL, as would be deter- 
mined by the step 130. Computing (140) the probability 
expression for 82 is done as follows: 

log[PCilS2)]-!og[PCr)] 

+log [P(W_FEELtO/P(W_FEEL)] 

+iog [P(/sp_ifr)/p(/sp_i)] 

+Iog [P(SP_J2|10/P(SP_2)] 
■i4og[P(/SP_jrr)/P(/SP_3)] 
+!og [P(/SP_4riO/P(/SP_4)] 
-log [0.0464] 
4iog [0.3153/0.0786] 
+log [0.7838/0.9611] 
-riog [0.1612/0.0388] 
+log [0,8829A).9611] 
+teg [0.8739/0.9611] 
—0.2785 

[0059] Likewise, for the third sentence, it's features 
are /W_FEEL, /SP_1, /SP__2, SP__3 and /SP_4, and its 
probability expression value is -1.1717. The probability 
eiqpression value for the fourth sentence is -1.1760. The 
maximum prc^ability expression value is -0.2785, corre- 
sponding to S2. Thus, the second sentence is chosen (160) 
as the most likely thesis statement, according to the method 
100. 

[0060] Note that the prior probability term P(T) is the 
same for every sentence; thus, this term can be ignored for 
purposes of the method 100 for a given discourse category. 
Note also that while the preceding calculations were per- 
formed using base-10 logarithms, any base (e.g., natural 
logarithm. In) can be used instead, provided the same base 
logarithm is used consistently. 

[0061] IV. Constructing the Automatic Essay Analyzer 

[0062] flG. 2 is a flowchart of a process 200 for training 
the method 100, according to an embodiment of the inven- 
tion. The process 200 begins by accepting (210) a plurality 
of essays. The essays are preferably in electronic form at this 
step. The method 200 then accepts (210) manual aniK>ta- 
tions. The method 200 then determines (225) the univeise of 
all possible features A.^ . . . A^. Finally, method 200 com- 
putes (260) the empirical probability relating to each feature 
A^ across the plurality of essays. 

[0063] The preferred method of accepting (210) the plu- 
rality of essays is in the form of electronic documents and 
the preferred electronic format is ASCII. The preferred 
method of accepting (210) the plurality of essays is in the 
form of stored or directly entered electronic text Alterna- 
tively or additionally, the essays could be accepted (210) 



us 2002/0142277 Al 



4 



Oct. 3,2002 



utiliziDg a method comprised of the steps of scanaiag the 
paper forms of the essays, and perfonning optical character 
recognition on the scanned paper essays. 

[0064] The preferred method of accepting (220) manual 
annotations is in the fonn of electronic text essays that have 
been manually annotated by humans skilled in the art of 
discourse element identification. Hie preferred method of 
indicating the manual annotation of the pre-^)ecified dis- 
course elements is by the bracketing of discourse elements 
within starting and ending ^tags" (e.g. <SustaiDed Idea> . . . 
</Sustained Idea>y <Thesis Statement> . . . </rhesis State- 
ment>). 

[0065] The preferred embodiment of method 200 then 
determines (225) the universe of all possible features for a 
particular discourse item. The feature determination step 
225 begins by determining (230) the universe of positional 
features . . . A^. Next, the feature determination step 225 
determines (240) the universe of word dioice features 
. . . A^. Finally, the feature determination step 225 
determines (250) the universe of rhetorical structure theory 
(RST) features A^^^ . . . Aj^. 

[0066] An embodiment of the positional features determi- 
nation step 230 loops through each essay in the plurality of 
essays, noting the position of demarked discourse elements 
within each essay and determining the number of sentences 
in that essay. 

[0067] An embodiment of the word choice features deter- 
mination step 240 parses the plurality of essays and create a 
list of all words contained within the sentences marked by a 
human annotator as being a thesis statement. Alternatively 
or additionally, the word choice features A^^j . . . A„ 
universe determination step 240 can accept a list of prede- 
termined list of words of belief, words of opinion, etc 

[0068] An embodiment of the RST (rhetorical structure 
theory) features determination step 250 parses the plurality 
of essays to extract pertinent. The RST parser of preference 
utilized in step 250 is described in Maicu, D., "The Rhe- 
torical Parsing of Natural Language Texts," Proceedings of 
the 35th Armual Meeting of the Assoc. for Computational 
Linguistics, 1997, pp. 96-103, which is hereby incorporated 
by reference. Further background on RST is available in 
Mann, W. C. and S. A. Thompson, "Rhetorical Structure 
Theory: Toward a Functional Theory of Text Organization," 
Text 8(3), 1988, pp. 243-281, which is also hereby incor- 
porated by reference. 

[0069] For each discourse element, the method 200 com- 
putes (260) the empirical frequencies relating to each feature 
A^ across the plurality of essays. For a sentence (S) in the 
discourse category (T) the following pn^abilities are deter- 
mined for each A^: P(T), the prior probability that a sentence 
is in discourse category T; P(A£fl), the conditional prob- 
ability of a sentence having feature A^, given that the 
sentence is in T; P(A^ the prk>r probability that a sentence 
contains feature A^; P[ Ajl}, the conditional probability that 
a sentence does not have feature A^, given that it is in T; and 
P(AJ, the prior probability that a sentence does not contain 
feature Aj. 

[0070] The method 100 and the process 200 can be 
performed by computer programs. The computer programs 
can exists in a variety of forms both active and inactive. For 
example, the computer programs can exist as software 



program(s) comprised of program instructions in source 
code, object code, executable code or other formats; firm- 
ware program(s); or hardware description language (HDL) 
files. Any of the above can be embodied on a computer 
readable medium, which include storage devices and sig- 
nals, in compressed or uncompressed form. Exemplary 
computer readable storage devices include conventional 
computer system RAM (random access memory), ROM 
(read only memory), EPROM (erasable, programmable 
ROM), EEPROM (electrically erasable, programmable 
ROM), and magnetic or optical disks or tapes. Exemplary 
computer readable signals, whether modulated using a car- 
rier or not, are signals that a computer system hosting or 
running the computer programs can be configured to access, 
including signals downloaded through the Internet or other 
networks. Concrete examples of the foregoing include dis- 
tribution of executable software pn)gram(s) of the computer 
program on a CD ROM or via Internet download In a sense, 
the Internet itself, as an abstract entity, is a computer 
readable medium. The same is true of computer networks in 
general. 

[0071] V. Experiments Using the Automated Essay Ana- 
lyzer 

[0072] A. Experiment 1 — ^Baseline 

[0073] Experiment 1 utilizes a Bayesian classifier for 
thesis statements using essay responses to one Engli^ 
Proficiency Test (EPT) question: Topic B. The results of this 
experiment suggest that automated methods can be used to 
identify the thesis statement in an essay. In addition, the 
performance of the classification method, given even a small 
set of manually annotated data, appears to approach human 
performance, and exceeds baseline performance. 

[0074] In collaboration with two writing experts, a simple 
discourse-based aimotation protocol was developed to 
manually annotate discourse elements in essays for a single 
essay topic. This was the initial attempt to anrntate essay 
data using discourse elements generally associated with 
essay structure, sudi as thesis statement, concluding state- 
ment, and topic sentences of the essay's main ideas. The 
writing experts defined the characteristics of the discourse 
labels. These experts then completed the subsequent anno- 
tations using a PC-based interface implemented in Java. 

[0075] Table 1 indicates agreement between two human 
annotators for the labeling of thesis statements. In addition, 
the table shows the baseHne performance in two ways. 
Thesis statements commonly appear at the very beginning of 
an essay. So, we used a baseline method where the first 
sentence of eadi essay was automatically selected as the 
thesis statement. This position-based selection was then 
compared to the resolved human annotator thesis selection 
(i.e., final annotations agreed upon by the two hiunan 
annotators) for each essay (IY>sition-Based&H). In addition, 
random thesis statement selections were compared with 
humans 1 and 2, and the resolved thesis statement 
(Random&H). The % Overlap column in Table 1 indicates 
the percentage of the time that the two aniK)tators selected 
the exact same text as the thesis statement. Kappa between 
the two human annotators was 0.733. This indicates good 
agreement between human armotators. This kappa value 
suggests that the task of manual selection of thesis state- 
ments was well-defined. 



us 2002/0142277 Al 



5 



Oct. 3, 2002 



TABLE 1 



Annotators 


% Overlap 


1&2 


53.0% 


Positioii-Based&H 


240% 


Random&H 


7.0% 



[0076] B. ExperimeDt 2 

[0077] Experiment 2 utilized three general feature types to 
build the classifier a) sentence position, b) words conunonly 
occurring in a thesis statement, and c) RST labels from 
outputs generated by an existing rhetorical structure parser 
(Marcu, 1997). Trained the classifier to predict thesis state- 
ments in an essay. U^g the multivariate Bernoulli formula, 
below, this gives us the log pn^ability that a sentence (S) in 
an essay belongs to the class (T) of sentences that are thesis 
statements. 

[0078] Experiment 2 utilized three kinds of features to 
build the classifier. These were a) positional, b) lexical, and 
c) Rhetorical Structure Theory-based discourse features 
(RSI). V/iih regard to the positional feature, we found that 
in the human annotated data, the aimotatocs typically 
marked a sentence as being a thesis toward the beginning of 
the essay. So, sentence position was a relevant feature. M^th 
regard to lexical information, our research indicated that if 
we used as features the words in sentences annotated as 
thesis statements that this also proved to be useful toward the 
identification of a thesis statement In addition information 
&om RST-based parse trees is or can be useful. 

[0079] Two kinds of lexical features were used in Experi- 
ment 2: a) the thesis word list, and b) the bebef wc^ list. For 
the thesis word Hst, we included lexical information in thesis 
statements in the following way to build the thesis statement 
classifier. For the training data, a vocabulary list was created 
that included one occurrence of each word used in a theas 
statement (in training set essays). All words in this list were 
used as a lexical feature to build the thesis statement 
classifier. Since we found that our results were better if we 
used all words used in thesis statements, no stop list was 
used. The belief word list included a small dictionary of 
approximately 30 words and phrases, such as opinion, 
important, better, aixl in order that. These words and phrases 
were common in thesis statement text. The classifier was 
trained on this set of words, in addition to the thesis word 
vocabulary list. 

[0080] According to RST, one can associate a rhetorical 
structure tree to any text. The leaves of the tree corre:^nd 
to elementary discourse units and the internal nodes corre- 
spond to contiguous text spans. Text spans represented at the 
clause and sentence level. Each node in a tree is character- 
ized by a status (nucleus or sateUite) and a rhetorical 
relation, which is a relation that holds between two non- 
overlapping text ^answ The distinction between nuclei and 
satellites comes firom the empirical observation that the 
nucleus expresses what is more essential to the writer's 
intention than the satellite; and that the nucleus of a rhe- 
torical relation is comprehensible indq>endent of the satel- 
lite, but not vice versa. When ^>ans are equally important, 
the relation is multinuclear Rhetorical relations reflect 
semantic, intentional, and textual relations that bold between 



text spans. For example, one text ^an may elaborate on 
arK>ther text span; the information in two text spans may be 
in contrast; and the information in one text span may provide 
background for the information presented in another text 
span. The algorithm considers two pieces of information 
firom RST parse trees in bmlding the classifier a) is the 
parent rKxle for the sentence a nucleus or a satellite, and b) 
what elementary discourse units are associated with thesis 
versus non-thesis sentences. 

[0081] In Experiment 2, we examined how well the algo- 
rithm performed compared to the agreement of two himian 
judges, and the baselines in Table 1. Table 2 indicates 
performance for 6 cross-validation runs. In these runs, % of 
the data were used for training and Ve for subsequent 
cross-validation. Agreement is evaluated on the Vfe of the 
data. For this experiment inclusion of the following features 
to build the classifier yielded the results in Table 2: a) 
sentence position, b) both RST feature types, and c) the 
the^ word list. We applied this cross-validation method to 
the entire data set (All), v/hcrc the training sample con- 
tained 78 thesis statements, and to a gold-standard set where 
49 essays (GS) were used for training. The gold-standard set 
includes essays where human readers agreed on armotations 
independently. The evaluation compares agreement between 
the algorithm and the resolved armotation (A&Res), human 
armotator 1 and the resolved annotation (l&Res), and 
human armotator 2 and the resolved annotation (2&Res). ^% 
Overlap" in Table 2 refers to the percentage of the time that 
there is exact overlap in the text of the two annotations. The 
results are exceed both baselines in Table 1. 

TABLE 2 



Mean percent ovcrtap for 6 cross-validation runs. 



Annotators 


N 


Matches 


% Overlap Agreement 


AlliA&Res 


15.5 


7.7 


50.0 


GSJK&Ra 


9 


5.0 


56.0 


l&Res 


15.5 


9.9 


64.0 


2&Res 


15.5 


9.7 


63.0 



C Experiment 3 



[0082] A next experiment shows that thesis statements in 
essays appear to be characteristically different from a sum- 
mary sentence in essays, as they have been identified by 
human armotators. 

[0083] For the Topic B data bom Experiment 1, two 
human annotators used the same PC-based annotation inter- 
face in order to annotate one-sentence summaries of essays. 
Anew labeling option was added to the interface for this task 
called "Summary Sentence'*. These annotators had not seen 
these essays previously, nor had they participated in the 
previous armotation task. Annotators were asked to inde- 
pendently identify a single sentence in each essay that was 
the surrmiary sentence in the essay. 

[0084] The kappa values for the manual annotation of 
thesis statements (Th) as compared to that of summary 
statements (SumSent) shows that the former task is much 
more clearly defined. We see that the kappa of 0.603 does 
not show strong agreement between armotators for the 
summary sentence task. For the thesis annotation task, the 
ks^a was 0.733 which shows good agreement between 
annotators. In Table 3, the results strongly indicate that there 



us 2002/0142277 Al 



6 



Oct. 3, 2002 



was very little overlap in eadi essay between what human 
annotators had labeled as thesis statements in the initial task, 
and what had been annotated as a summary sentence (Th/ 
SumSent Overlap). This strongly suggests that there are 
critical differences between thesis statements and suomiary 
sentences in essays that we are interested in e^loring 
further. Of interest is that some preliminary data indicated 
that what annotators marked as summary sentences ^>pear 
to be more closely related to oondudii^ statements in essay. 

TABLE 3 



Kappa and Percent Overlap Between 
Manual TTiesis Selections (Th) and Summary Statements (SomSent> 

Th SumSent Th/SomSent Overlap 

Kappa .733 .603 N/A 

% Overlap .53 .41 .06 



[0085] From the results in Table 3, we can infer that thesis 
statements in essays are a different genre than, say, a 
problem statement in journal articles. From this perspective, 
the thesis classification algorithm appears to be appropriate 
for the task of automated thesis statement identification. 

[0086] D. E^riment 4 

[0087] How does the algorithm generalize across tc^ics? 
The next experiment tests the generalizability of the thesis 
selection method. Specifically, this e?q)eriment answers the 
question whether there were positional, lexical, and dis^ 
course features that underlie a thesis statement, and whether 
or not they were topic independent. If so, this would indicate 
an ability to annotate thesis statements across a number of 
topics, and re-use the algorithm oo additional topics, without 
further armotation. A writii^ expert manually aimotated the 
thesis statement in approximately 45 essays for 4 additional 
topics: Topics A, C, D and E. She completed this task using 
the same interface that was used by the two annotators in 
Experiment 1. The results of this experiment suggest that the 
positional, lexical, and discourse structure features applied 
in Experiments 1 and 2 are generalizable across essay topic. 

[0088] To test the generalizability of the method, for each 
EFT topic the thesis sentences selected by a writing expert 
were used for building the classifier. Five combinations of 
four prompts were used to build the classifier in each case, 
and that classifier was then cross-vafidated on the fifth topic, 
not used to buQd the classifier. To evaluate the performance 
of each of the classifiers, agreement was calculated for each 
'cross-validation' sample (single topic) by comparing the 
algorithm selection to our writing expert's thesis statement 
selection. For example, we trained on Topics A, B, C, and D, 
using the thesis statements selected manually. This classifier 
was then used to select, automatically, thesis statements for 
Topic E. In the evaluation, the algorithm's selectk>n was 
compared to the manually selected set of thesis statements 
for Topic E, and agreement was calculated. Exact matdies 
for each run are presented in Table 4. In all but one case, 
agreement exceeds both baselines from Table 1. In two 
cases, where the percent overlap was lower, on cross- 
validation (Topics A and B), we were able to achieve higher 
overlap using the vocabulary in belief word list as features, 
in addition to the thesis word list vocabulary. In the case of 
Topic A, we achieved higher agreement only when adding 
the belief word list feature and applying the classical Bayes 
approach (see footnote 2). Agreement was 34% (17/50) for 
Topic B, and 31% (16/51) for Topic A. 



TABLE 4 



Performance on a Single Cross-validation Topic (CV Topic) 



Usine Four Unique Essa' 


V Topics for Training 




Training Ibpics 


CV Topic 


N 


Matches ^ 


Overlap 


ABCD 


E 


47 


19 


40.0 


ABCE 


D 


47 


22 


47.0 


ABDE 


C 


31 


13 


42.0 


ACDE 


B 


50 


15 


30.0 


BCDE 


A 


51 


12 


24.0 



[0089] The experiments described above indicate the fol- 
lowing: Wth a relatively small corpus of manually anno- 
tated essay data, a multivariate Bernoulli approach can be 
used to build a classifier using positional, lexical and dis- 
course features. This algorithm can be used to automatically 
select thesis statements in essays. Results from both experi- 
ments indicate that the algorithm's selection of thesis state- 
ments agrees with a human judge almost as often as two 
human judges agree with each other. Kappa values for 
human agreement suggest that the task for manual armota- 
tion of thesis statements in essays is reasonably well- 
defined. We are refining the current annotation protocol so 
that it defines even more clearly the labeling task. We expect 
that this will increase human agreement in future aimota- 
tions, and the reliability of the automatic thesis selection 
since the classifiers are built using the manually annotated 
data. 

[0090] The experiments also provide evidence that this 
method for automated thesis selection in essays is gpneral- 
izable. That is> once trained on a few human annotated 
prompts, it could be applied to other prompts given a similar 
population of writers, in this case, writers at the college 
fr^hman level. The larger implication is that we begin to see 
that there are underlying discourse elements in essays that 
can be identified, independent of the topic of the test 
question. For essay evaluation q^plications this is critical 
since new test questions are continuously being introduced 
into on-line essay evaluation applications. It would be too 
time-consuming and costly to repeat the annotation process 
for all new test questions. 

[0091] V. Conclusion 

[0092] What has been described and illustrated herein is a 
preferred embodiment of the invention along with some of 
its variations. The terms, descriptions and figures used 
herein are set forth by way of illustration only and are not 
meant as limitations. Those skilled in the art will recognize 
that many variations are possible within the spirit and scope 
of the invention, which is intended to be defined by the 
following claims — and their equivalents — in which all terms 
are meant in their broadest reasonable sense unless other- 
wise indicated. 

What is claimed is: 

1. A method for automated analysis of an essay, the 
method comprising: 

accepting an essay; 

determining whether each of a predetermined set of 
features is present or absent in each sentence of the 
essay; 

for each sentence in the essay, calculating a probability 
that the sentence is a member of a certain discourse 



us 2002/0142277 Al 



7 



Oct. 3, 2002 



element category, wherein the probability is based on 
the determinations of whether each feature in the set of 
features is present or absent; and 

choosing a sentence as the choice for the disoouise 
element category, based on the calculated probabilities. 

2. The method of claim 1 wherein the discourse element 
category is thesis statement. 

3. The method of claim 1 wherein the essay is in an 
electronic form. 

4. The method of claim 3 wherein the essay is an ASCII 
file. 

5. The method of claim 1 wherein the accepting step 
comprises: 

scanning a paper form of the essay; and 

performing optical character recognitioo on the scarmed 
paper essay. 

6. The method of claim 1 wherein the predetermined set 
of features comprises: 

a feature based on position within the essay. 

7. The method of claim 1 wherein the predetermined set 
of features comprises: 

a feature based on presence or absence of certain words. 

8. The method of claim 7 wherein the certain words 
comprise words empirically associated with thesis state- 
ments. 

9. The method of claim 7 wherein the certain words 
comprise words of belief. 

10. The method of claim 1 wherein the predetermined set 
of features comprises: 

a feature based on rhetorical relation. 

11. The method of claim 10 wherein the determimng step 
comprises: 

parsing the essay using a rhetorical structure parser 

12. The method of claim 1 wherein the calculating step 
comprises: 

utilizing a multivariate Bernoulli model. 

13. The method of claim 12 wherein the calculating step 
calculates the following quantity for eadi sentence: 

Zlogl/tA, I T)/f\A,)] if Ai present 
log[P(Af I T)/ f%Ai)] if Ai not present 

wherein 

P(Ai|l^ is a conditional probability that a sentence has a 
feature Ai given that the sentence is in a class T; 

P(/Aifl) is a conditional probability that a sentence does 
not have a feature Ai given that the sentence is in a dass 
T; 



P(Ai) is a prior probability that a sentence contains a 
feature Ai; and 

P(/Ai) is a prior probability that a sentence does not 
contain a feature Ai. 

14. The method of claim 13 wherein the choosing step 
comprises: 

choosing the sentence for which the quantity is the largest. 

15. The method of claim 1 wherein the calculating step 
comprises: 

utilizing a LaPlace estimator. 

16. The method of claim 1 further comprising: 

providing an essay question, the essay being an answer to 
the essay question. 

17. The method of claim 1 further comprising: 

repeating the calculating and choosing steps for one or 
more different discourse element categories. 

18. The method of claim 1 further comprising: 

ou^utting the dioice. 

19. The method of claim 1 further comprising: 
outputting a revision checklist. 

20. A process of training an automated essay analysis 
method, Uie process comprising: 

accepting a plurality of essays; 

accepting manual annotations demarking discourse ele- 
ments in each of the plurality of essays; 

accepting a set of features that purportedly correlate with 
whether a sentence in an essay is a particular type of 
discourse element; 

calculating empirical probabilities relating to the fre- 
quency of the features; and 

calculating empirical probabilities relating features in the 
set of features to discourse elements. 

21. The process of claim 20 further comprising: 

performing the method of claim 1 on each of the plurality 
of essay; and 

judging the performance of the method of claim 1 as 
compared to the manual annotations; and 

if the performance of the method of claim 1 is inadequate, 
modifying the set of features and repeating the method 
of claim 1. 

22. A computer readable medium on which is embedded 
a computer program, the computer program performing the 
method of claim 1. 

23. A computer readable medium on whidi is embedded 
a computer program, the computer program performing the 
process of claim 20. 

***** 



